Gini Index and Cost Complexity
Which of the final nodes (or leaves) is most pure?
Which is least pure?
Could we split a node further for better purity?
Almost certainly, yes! It’s highly unlikely that all of the unused variable have exactly the same prevalence across categories.
Should we do it, or is that overfitting?
The Gini Index for a particular leaf (not overall) is the average of errors in each class:
\((0.35*0.65) + (0.21*0.79) + (0.14*0.86) = 0.5138\)
To calculate the Gini Index average across all leaves (:
So, when should we split the tree further?
Only if the new splits improve the Gini Index by a certain amount.
This is the cost_complexity parameter!
But wait! This is a penalized metric, using an arbitrary penalty \(\alpha\) to avoid overfitting.
Don’t we like cross-validation better?
Well… yes.
But imagine fitting every possible tree and cross-validating…. yikes.
We have to limit our options and cut our losses somehow!
Bagging
Suppose I took two random subsamples of my cannabis dataset:
Then I fit a decision tree to each:
How similar will the results be?
So… which tree should we believe?
Let’s take several subsamples of the data, and make trees from each.
Then, to classify a new observation, we run it through all the trees and let them vote!
(It’s a bit like a KNN for trees!)
This is called bagging.
Caution
This step can take awhile! Be patient!
What variables were most important to the trees?
parsnip model object
Bagged CART (classification with 5 members)
Variable importance scores include:
# A tibble: 63 × 4
term value std.error used
<chr> <dbl> <dbl> <int>
1 Rating 321. 6.74 5
2 Sleepy 174. 4.07 5
3 Focused 65.6 4.35 5
4 Sweet 65.5 3.93 5
5 Creative 64.8 6.08 5
6 Relaxed 64.3 8.59 5
7 Earthy 62.5 2.24 5
8 Euphoric 62.2 6.83 5
9 Uplifted 60.2 2.31 5
10 Energetic 59.9 4.53 5
# ℹ 53 more rows
Random Forests
What if some important variables are being masked by more important variables?
Remember, we have 65 predictors - yikes! So, let’s give some of the other predictors a chance to shine.
Randomly choose a set of the predictors to include in the data:
# A tibble: 2,305 × 31
Type Orange Minty Pungent Strawberry Relaxed Tree Sage Pineapple Mouth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 hybrid 0 0 0 0 1 0 0 0 0
2 hybrid 0 0 0 0 1 0 0 0 0
3 sativa 0 0 0 0 1 0 1 0 0
4 hybrid 0 0 0 0 1 0 0 0 0
5 hybrid 1 0 0 0 1 0 0 0 0
6 indica 0 0 0 0 0 0 0 0 0
7 hybrid 0 0 1 0 1 0 0 0 0
8 indica 0 0 1 0 1 0 0 0 0
9 sativa 0 0 0 0 1 0 0 0 0
10 indica 0 0 0 0 1 0 0 0 0
# ℹ 2,295 more rows
# ℹ 21 more variables: Diesel <dbl>, Grape <dbl>, Coffee <dbl>, Lavender <dbl>,
# Mango <dbl>, Sleepy <dbl>, Tingly <dbl>, Flowery <dbl>, Sweet <dbl>,
# Creative <dbl>, Talkative <dbl>, Giggly <dbl>, Chestnut <dbl>, Skunk <dbl>,
# Tropical <dbl>, Ammonia <dbl>, Nutty <dbl>, Lime <dbl>, Dry <dbl>,
# Chemical <dbl>, Citrus <dbl>
After making many random reduced trees, we then bag the results to end up with a random forest.
The advantage of this is that more unique variables are involved in the process.
This way, we don’t accidentally overfit to a variable that happens to be extremely relevant to our particular dataset.
Open the Activity-RF-Bagging.qmd activity file
Reference for Fitting Trees
Don’t forget you can use the reference guide (in the R References Module on Canvas) for guidance on how to fit these models!